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Abstract 



Reduced-rank decompositions provide descriptions of the variation among the elements of a 
matrix or array. In such decompositions, the elements of an array are expressed as products of 
low-dimensional latent factors. This article presents a model-based version of such a decompo- 
sition, extending the scope of reduced rank methods to accommodate a variety of data types 
^^ such as longitudinal social networks and continuous multivariate data that are cross-classified 

Cd by categorical variables. The proposed model-based approach is hierarchical, in that the latent 

I I factors corresponding to a given dimension of the array are not a priori independent, but ex- 

,__! changeable. Such a hierarchical approach allows more flexibility in the types of patterns that 

> can be represented. 

<N 

^^ Some key words: Bayesian, multiplicative model, PARAFAC, regularization, shrinkage. 
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,3 1 Introduction 



Matrix-valued data are prevalent in many scientific disciplines. Studies in social and health sciences 
often gather social network data that can be represented by square, binary matrices with undefined 
diagonals. Numerical results from gene expression studies are recorded in matrices with rows 
representing tissue samples and columns representing genes. Analysis of stock market returns 
involves data matrices with rows representing stocks and columns representing time. With such 
data there are often dependencies among both the rows and the columns of the data matrices, and 
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so the standard tools of multivariate analysis, in which patterns along one dimension of the data 
matrix are thought of as i.i.d., may be inadequate for data analysis purposes. As an alternative to 
the i.i.d. paradigm, patterns of row and column variation in matrix-valued data are often described 
with reduced-rank matrix decompositions and models. For example, the i,jth entry of an mi x 7712 
matrix might be expressed as yij = (uj, Vj) + e^.j, where the heterogeneity among a set of low- 
dimensional vectors {ui, . . . , Umi} and {vi, . . . , Vm2} is used to represent heterogeneity attributable 
to the row and column objects respectively. Such models can be described as being bilinear, as 
the expectation of yi,j is a bilinear function of the parameters. These models are related to biplots 



Gabriel, 1971 , bilinear regression [Gabriel 1998 and the singular value decomposition (SVD). 



In more complex situations the data take the form of a multidimensional array instead of a 
matrix. For example, temporal variation in a social network over a discrete set of time points may 
be represented by a three-way array Y = {yij,t}, where yi,j,t describes the relationship between 
nodes i and j at time t. Similarly, gene expression data gathered under a variety of experimental 
conditions, or multiple variables measured on a set of companies over time are also examples of 



array- valued or multiway data. Surveys of multiway data analysis include Coppi and Bolasco 1989 



and Kroonenberg [2008 . The July- August 2009 issue of the Journal of Chemometrics was dedicated 



to Richard Harshman, one of the founders of three-way data analysis. Harshman iHarshman, 1970 



Harshman and Lundy, 1984 developed a three-way generalization of the SVD known as "parallel 



factor analysis", or PARAFAC, that has become one of the primary methods of multiway data 
analysis. While the SVD represents the i, jth element of a rank-i? matrix A as ajj = (uj, Vj) = 
Ylr=i'^i,r''^j,r^ the PARAFAC decomposition of a three-way array represents the i,j,kth. element 



as a. 



i,j,k 



Kruskal 


1976 


1977 



related such decompositions to a 



precise definition of rank for three-way arrays, in which the rank is the smallest integer R for which 
the above representation holds. The generalization to arbitrary dimensions is straightforward: A 
il'- dimensional array of rank R is one in which the elements can be expressed as a multilinear 



function of i?- dimensional factors. A compact review of these results and others appears in Kruskal 



1989 



While the area of multiway data analysis has been active, most of the focus has been on al- 
gorithms for finding least-squares solutions, pre- and post-processing of results, and interpretation 
of the least-squares parameters. Little has been done in terms of incorporating multilinear repre- 



sentations into statistical models. One exception is the work of Vega-Montoto and Wentzell 2003] 



and Vega-Montoto et al. 2005 , who develop algorithms for finding maximum likelihood solutions 
for situations with heteroscedastic or correlated error terms. However, these algorithms assume the 
error variance is known. 

This article develops a hierarchical multilinear model for incorporation into a variety of non- 
standard multiway data analysis situations, and presents a Bayesian approach for parameter es- 
timation. The motivation is twofold: First, multilinear array representations can involve a large 
number of parameters. Overfitting of the model can be ameliorated by using shrinkage estimators 
provided by a Bayesian approach. In particular, a hierarchical Bayesian approach can be used to 
provide shrinkage patterns that are based primarily on the observed data, rather than relying heav- 
ily on a fixed prior distribution. The second motivation is that Bayesian approaches and MCMC 
estimation methods allow one to incorporate the basic multilinear representation into models for 
complex data that might involve additional dependence structures or discrete data. 

After presenting the hierarchical multilinear model and Bayesian methods for estimation in 
Sections 2 and 3, a small simulation study is presented in Section 4 to compare mean squared errors 
of three different parameter estimation methods: least-squares, a simple non-hierarchical Bayesian 
approach and a Bayesian hierarchical approach. The Bayes estimators are found to outperform the 
least-squares estimator, with the hierarchical Bayes procedure giving the best performance. The 
performance of the estimators when the rank of the model is misspecified is also considered. In this 
situation, the least-squares and non-hierarchical Bayes procedures increasingly overfit the data as 
the rank is increased, while the hierarchical Bayes procedure is robust to rank misspecification. 

Sections 5 and 6 give examples in which it is useful to embed a multilinear model within a 
larger model for observed data. Section 5 considers estimation of a multivariate mean E[yx] = fj,^ 
for each possible value of a vector of categorical variables x. Often the number of observations 
per level of x is small and varies from level to level. A hierarchical model for the mean, /x^ ~ 
multivariate normal(/3j,, S), allows for consistent estimation of each fi^ but shrinkage towards (3^ 
when the sample size is small. The values B = {(3^ : x G X} can be represented as a multiway 
array, and a reduced rank multilinear model for B allows for the modeling of non-additive effects 
of X with a relatively small number of parameters. 

Section 6 presents an analysis of international cooperation and conflict during the cold war. 



The data consist of a three-way array with element Vij^t representing the relationship between 
countries i and j in year t. Several features of these data make existing tools from multiway data 
analysis inappropriate, one being that the data are ordinal. The range of the data includes the 
integers from -5 to 2, indicating different levels of military cooperation or conflict. Assuming that 
the ?/ij,fc's are normally distributed or even continuous would be inappropriate. However, using the 
tools developed in this article it is reasonably straightforward to embed a multilinear representation 
within an ordered probit model for these data. A discussion of the results and directions for future 
research follows in Section 7. 

2 Reduced rank models for array data 

In this section we review the reduced rank model and an alternating least-squares (ALS) procedure 
for parameter estimation. For a review of the properties, limitations and alternatives to ALS, see 
Tomasi and Bro [2006 and Chapter 5 of Kroonenberg [2008 . 



2.1 Rank and factor representations for arrays 

Given an mi x 7112 data matrix Y it is often desirable to separate out the "main features" of Y 

from the "patternless noise." This motivates a model of the form Y = + E, where is to be 

estimated from the data. Interpreting "main features" as those that can be well-approximated by 

a low-rank matrix, the rank of is usually taken to be some value R < mi A ?7i2- The rank of a 

matrix can be defined as the smallest integer R such that there exists matrices U G j^^ix^ and 

VgM™2X-R such that 

R 

= ^ Ur (g) Vr = UV^, 

r=l 

where u,. is the rth column of U, or equivalently 

R 

T-=l 

where Uj is the ith row of U. Variation among the rows of U represents the heterogeneity in 
attributable to variation in the row objects, and similarly variation among the rows of V represents 
heterogeneity attributable to the column objects. 

A i^-order multiway array Y with dimension m-i x • • • x m^x has elements {yii,...,jj^ : ik £ 
{1, . . . , m-fc}}. As with a matrix, we may define a model for a i('-order array as Y = + E, where 



E is an array of uncorrelated, mean-zero noise and is a reduced rank array to be estimated. 
Following Kruskal [1976 and Kruskal [1977] , the rank of a i^-order array is simply the smallest 



integer R such that there exist matrices {U^ ' G M™fc^^^ k = 1, . . . , K}, such that 

= ^u«$D---C3uW = (uW,...,uW), 

r=l 

where u}. is the rth column of U*^ -*, or equivalently 



Oii,...,tK — Z_^^h,r ^ ^ ^iK,r — \"ii ' '^iif /' 

r=l 

where u^ is the ith row of U' ' . As in the matrix case, variation among the rows of U^ ' represents 
heterogeneity attributable to the A;th set of objects, that is, the kth. mode of the array. 

2.2 Least squares estimation 

In the matrix case the least squares estimate of = UV (also the MLE assuming normal, i.i.d. 
errors) can be obtained from the first R components of the singular value decomposition of Y. For 
arrays of higher order, only iterative methods of estimation are available. Perhaps the simplest 
method of parameter estimation is the alternating least squares algorithm (ALS), in which factors 
corresponding to a given mode are updated to minimize the residual sums of squares given the 
current values for the other modes. In this subsection we review the relevant calculations for ALS, 
which will also be useful for Bayesian estimation in the next section. 

Estimation for a three-way model: We begin with a three-way array so that the main ideas 
can be understood with a minimal amount of notational complexity. Let Y be a three-way array 
modeled as yij^k = (uj, Vj, vi^fc) + eij^k, with {eij^k} ~ i-i-d. normal(0,o"^). We can write 

yi,;k = V(ui o wfc) + ei,.,fc 
y-j,fc = U(vj o wfc) + e.j,fc, 

where U, V, W are mi x R, 1712 x R and 7713 x R matrices respectively, Uj, Vj, w^ are rows of these 
matrices, Yij^., Yi^.^ki y-,j,k ^^^ vectors of length mi, m,2 and ms, and "o" denotes the Hadamard 



product (elementwise multiplication). Some matrix algebra and careful summation shows that, as 
a function of U, p(Y|U, V, W) can be written 

p(Y|U,V,W) ex etr(U^L/a2-U^UQ/[2cj2]) , where (1) 

Q = (V^V)o(W^W) and 

and etr(A) = exp{trace(A)}. With V and W fixed, the conditional MLE and least-squares estimate 
of U is given by U = LQ~ . The ALS procedure is to iteratively replace a current value of U 
with its conditional least-squares estimate, then replace V and W similarly. This procedure is 
then iterated until a pre-specified convergence criterion has been met. Typically, the algorithm is 
replicated beginning with several different randomly generated initial values, with each replicate 
iterated until the relative fit ||Y — 0|p/||/||Y|p is below a user-defined threshold. A comparative 



study of different least-squares estimation methods done by Tomasi and Bro , 2006 concluded that 
the ALS procedure provides a good compromise between computational complexity and quality of 
the solution. 

Estimation for a X-way model: Now suppose Y is an mi x • • • x mx array. Let U' ^ , . . . , U*^ •* 
be the matrices of factors for the K modes, so that U^ ' is an tti^ x R matrix. Generalizing the 
approach for the three-way model, let y;^ = (yi,i2,...,i^, • • • ■,yni,i2,...,im) be a "fiber" along the first 
dimension of the array. Then we can write y;^ = U' ^(u^ o u^ o • • • o u^ ) + ei^. Similar to the 
three-mode case, as a function of U*^ ' , p(Y|U^ \ . . . , U' ') can be written 

P(Y|UW,...,U(^)) oc etr(u(^)^L/a2-U(i)^U(i)Q/[2a2]) , where (2) 

Q = (u(2)V2))o...o(uW^U(^))and 

The conditional MLE and least squares estimator of U' •* given the factor values for the other 
modes is thus U = LQ^ . As with three-way data, the ALS procedure is to iteratively replace 
the factors matrices with their conditional least-squares estimates until convergence. 



3 Bayes and hierarchical Bayes estimation 

Compared to least-squares or maximum likelihood methods, Bayesian procedures often provide sta- 
ble estimation in high-dimensional problems due to regularization via the prior distribution. Using 
conjugate prior distributions, this section provides a Gibbs sampling scheme that approximates the 
posterior distribution p({J^ ' , . . . ,1]^ ' ,a'^\Y), and by extension, an approximation to the posterior 
distribution of = (U'^\ . . . , U*- '). The posterior expectation of can be used as a Bayesian 
estimate of the main features of the data array. 

3.1 A basic Gibbs sampler 

Let the prior distribution for U*- ' be such that that the rows of U' ' are i.i.d. multivariate 
normal(/Xfc, ^k) or equivalently, U^ ' ~ matrix normal(Mfc = 1/x^, '^fc,I) with density 

P(U('=)) oc etr(-(uW-Mfcf(u('=)-Mfc)*-V2) 
oc etrCU^'^^^Mfc*-! -U(^')^U('=)*^V2). 

Combining this with the likelihood from Equation^ it follows that if U*^ ^ ~ matrix normal(Mi, 'J'l, I) 
a priori, then the full conditional distribution is also matrix normal with density 

KU(1)|Y,U(2),...,U(^)) oc etr(-(uW-Mif(uW-Mi)/2)*^^) 

*i = (Q/(t2 + *-!)-! 
Ml = (L/cr^ + Mi*^-^)*!. 

Full conditional distributions for U^ ',■■■, \J^"^' are derived analogously. Using a conjugate inverse- 
gamma(fo/2, voaQ/2) prior distribution for a'^ results in an inverse- gamma(a, b) full conditional 
distribution where a = (i^o + Uk'^k)/'^ and 6 = (z^oo-q + HY- (U^^), . . . , U(^))|p)/2. 

A Markov chain Monte Carlo approximation to p{V^ ',. . .,1]^ \o"^|Y) can be made by iter- 
atively sampling each unknown quantity from its full conditional distribution. This generates a 
Markov chain, samples from which converge in distribution to p(U' ' , . . . ,\J^ ' , (t^|Y). However, 
it would be inappropriate to estimate U^ ' by its posterior mean U , or with (U , . . . , U ) , 
as the values of the latent factors are not separately identifiable. For example, the likelihood is 
invariant to joint permutations and complementary rescalings of the columns of the U' ^'s (see 
Kruskal [1989 for a discussion of the uniqueness of reduced-rank array decompositions). Instead, 



the posterior mean estimate of 0, obtained from the average of (U^ V . . , U^ -*) over iterations 
of the Markov chain, can be used as a point estimate of 0. If desired, point estimates of the U^ ''s 
can then be obtained from a rank-i? least-squares approximation of 0. 

3.2 Hierarchical modeling of factors 

Rarely will we have detailed prior knowledge of an appropriate mean /x^ and variance ^k for 
each factor matrix U*^ ' . Absent these, we may consider a simple "weak" prior distribution such as 
u\ , . . . , Uml ~ i.i.d. multivariate normal(0, r^I), where r^ is large. However, doing so would ignore 
patterns of heterogeneity in the U^ ^'s that could improve estimation of 0. Even though the value 
of is invariant to certain sign changes or permutations of the column of {U^ ' ,k = 1, . . . , K}, 
other patterns in the U' ^'s manifest themselves as patterns in and Y. 

To illustrate this, recall that the factors represent variance among the elements of the data 
array Y that can be attributed to heterogeneity within the various modes. Consider three mode 
data in which the first mode represents a large number of experimental units and the other two 
modes represent two sets of experimental conditions. In this case, yij^k is the measurement for unit 
i when condition one is at level j and condition two is at level k. Letting the factors corresponding 
to the three modes be U, V and W, modeling the rows ui, . . . , u^i of the mi x R factor matrix 
U as i.i.d. multivariate normal(/x, ^) induces a covariance among the elements of each unit-specific 
m2 X m^ matrix Yj = {yij^k, 1 ^ J ^ ^2, 1 < ^ < "^s}) given by the following calculation: 

Uj = fi + 7j, 7j ~ multivariate normal(0, ^) 

yi,j,k = {ui,Vj,Wk) +€ij^k 

= uf(vj owfc) + eij,fc = m'^Cvj o wfe) -F7f(vjO wfc) -heij,fc (3) 

Cov[yij^k, yi,i,m] = E[yJ{vj o Wk){-vi o Wm)^7j] + ct^I 

= tr((vj o wfc)(v/ o Wm)"^*) -Fcr^I 

= tr([(v,vf)o(wfcW^)]*) + a2l. (4) 

Each unit has a measurement under conditions (j, k) and under (/, m), and the correlation of these 
measurements across experimental units is determined by 'Vj\rf , ■w,fcW^ and the covariance matrix 
^. Additionally, Equation [3] indicates that the scale of n relative to that of ^ represents how much 
variability there is among the units. Fixing /x or ^ in advance places restrictions on these variances 



and correlations. This suggests the use of a hierarchical model as an alternative, whereby the mean 
and variance of the factors of each mode are estimated from the observed data. Returning to the 
general case of K modes, the proposed hierarchical model is as follows: 

{vl\ , . . . , u^|} ~ multivariate normal(/X;j, ^k) 

^lJ^^^k ~ multivariate normal(//o, *fc/'^o) 
^k ~ inverse- Wishart( So, z^o)- 

Readers familiar with factor models for matrices (the case oi K = 2) may be concerned about 
the non-orthogonality of the columns of the latent factor matrices in the above model. In the 
matrix case, the mean matrix for Y is given by = U' ^U*^ ' . Letting U = U^ ^H, A; = 1, 2 we 
see that = U U for any orthonormal matrix H. This invariance to rotation in the matrix 
case, however, does not generalize to rotation invariance for multilinear representations of arrays. 



Kruskal 1977 shows that other than some elementary invariances (such as a common relabeling of 
the columns of all the factor matrices), multilinear factor representations are generally rotationally 
unique. 

Diffuse priors can be used as a default, such as /^g = 0, kq = 1, i^o = -R+ 1 and Sq = Itq , where 
Tq is some pre-specified value determined by the scale of the measurements. As an alternative, unit 



information prior distributions fKass and Wasserman 1995 can be used, which weakly center the 



prior parameters around estimates obtained from the data. For example, Tq could be obtained 
as the variance of latent factor estimates obtained from a rank-i? least squares approximation to 
Y, and the prior distribution for o"^ could be weakly centered around the corresponding residual 
variance. In either case, the full conditional distributions for all parameters have straightforward 
derivations, and are summarized in the following Gibbs sampling scheme: Given current values of 
{U^^-', . . . , U*^ •*} and a^, new values of these parameters are generated as follows: 

1. For each k € {1, . . . , K} in random order, 

(a) sample *fc ~ inverse- Wishart([U('')^U(^) + Ir^]~\ rn-fe + R+l); 

(b) sample ^i^ ~ multivariate normal(U^ •* \/[mk + 1], '^k/l^^k + 1]); 

(c) sample U' ' ~ matrix normal(Mjt, ^fc, I), where 

. *fc = (Qfc/a2 + *,!)-!, and 



2. Sample o"^ ~ inverse-gamma(Po/2, t'o^o/^)' where 
• t'o = t'o + nfc=i "^fe' ^^^ 

Note that {{pj., ^k), k = 1, • • • , K} wiU not be separately identifiable since, for example, the scales 
of {U^ ',k = 1, . . . ,K} are not separately identifiable. However, a non-hierarchical Bayesian ap- 
proach restricts the overall scale of 0, as well the shrinkage point for the U^ •''s. In contrast, the 
hierarchical model allows these things to be determined by the data. 

4 Comparison of estimators 

This section presents the results of some simulation studies comparing the performance of the 
hierarchical Bayes procedure to ALS estimation. In the first study, one-hundred random ©-arrays 
were generated, each having dimension rrii x m2 x 7723 = 10 x 8 x 6 and rank R = 4, and each to 
be estimated from a corresponding "observed" data array Y. Letting R = m2 x 771,3, the and Y 
arrays were generated as follows: 

1. For each mode k G {1, 2, 3}, 

(a) sample ^k as follows: 

i. sample *o ~ Wishart (I, ^ -|- 1), 
ii. set uq = R + X where x ~ Poisson(v ^), 
iii. sample ^k ~ inverse-Wishart('J'O) ^0); 

(b) sample ^p, ~ multivariate normal (0, ^fc)j 

~ (k) 

(c) sample U '^ multivariate normal (/ij^., ^fc)- 

2. Let be the rank-i? least-squares approximation to (U , U ,U ), but rescaled so that 
the average squared magnitude of the elements J2 Of j j^/ imim2m3) is 1. 

3. Set Y = -F E, where {ejj- fc} ~ normal(0, 1/4). 
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We now go through the rationale for this simulation scheme. Working backwards, in steps 2 and 
3 the error variance for E is set to be 1/4 of the average squared magnitude of the elements of 
0. This makes estimation of © feasible but not trivial. In steps 1 and 2, we first generate an 
array having a maximal rank R, and then let be its rank-4 least-squares approximation. The 
rationale for this is to make the generated arrays somewhat different in distribution from the 
prior distribution that is used for estimation. If instead the parameter values were simulated from 
the prior distribution used for estimation, we would expect the Bayes procedure to outperform 
the ALS procedure simply because the the Bayes estimates would be a priori weighted towards 
their true values. By generating from a distribution other than the prior, we intend to give 
a more fair comparison between the performance of the Bayesian procedure and ALS estimation. 
Additionally, the "prior" parameters ^o and z^o in steps l.(a) are randomly generated in order to 
provide a broader range of patterns generated in the arrays than could be obtained from fixed 
values of ^0 and uq. 

4.1 Known rank 

We first examine the case where the presumed rank of is equal to the true rank of 4. Two 
estimates were computed for each of the one-hundred simulated 0-arrays: 

©LS (least squares), an estimate obtained via the alternating least-squares algorithm; 

©HB (hierarchical Bayes), a posterior estimate under the hierarchical model and unit infor- 
mation priors described in Section 3.2. 

The least squares estimates were obtained by running the ALS algorithm using twenty different 
random starting values and then selecting the one that gave the minimum residual sum of squares. 
For each starting value, the ALS algorithm was iterated until the magnitude of the change in the 
estimate, relative to the magnitude of the estimate, was less than 10^^. 

The Bayesian estimates were obtained using the Gibbs sampling scheme described in the previ- 
ous section, with 1,000 iterations to allow for convergence to the stationary distribution ("burn-in"), 
followed by 10,000 iterations for estimating the mean matrix. Mixing of the algorithm was assessed 
by monitoring the value of ||©|p across the 10,000 iterations of the Markov chain. Mixing was gen- 
erally good, with the median effective sample size (the equivalent numbers of independent Monte 
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Carlo samples) for ||0|p being 9,422. For each simulated data set we obtained a posterior mean 
estimate of 0. However, this estimate will generally have a rank higher than 4 as rank is not 
preserved under linear combinations. For this reason, the rank-4 least squares approximation to 
the posterior mean was also computed as an alternative Bayesian point estimate of 0. 
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Figure 1: Comparison of MSE and RSS for different estimation methods. 

The results of the simulation study are summarized in FigurejlJ For each data set and estimation 
method, the ratio of ||0 — 0|p/||Y — 0|p was computed to assess the performance of© relative to 
the unbiased estimate Y. In this example where the true rank of is known, using the reduced- 
rank ALS estimate is superior to using Y, giving reductions of mean squared error of roughly 60 
to 80%. However, the first panel of Figure [T] indicates that the Bayesian estimators provide a 
substantial further reduction in MSE, amounting to an additional reduction of 41% on average and 
up to 80% for particular data sets. Also, note that the rank-4 Bayesian point estimate performs 
essentially the same as the posterior mean estimate, even though the latter may be of rank higher 
than 4. 

One possible explanation for the superiority of the Bayesian approach over ALS is that the 
latter does not explore as much of the parameter space as an MCMC algorithm. The second panel 
of Figure 111 which plots the relative residual sum of squares (RSS) ||Y — 0|p/||Y|p for the ALS 
estimate versus the two Bayes estimates, suggests that this is not the case. This plot indicates that 
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©Ls is in fact closer to Y than ©hb for every simulated data set. This observation, together with 
the superiority of the Bayes estimate in terms of estimating 0, suggests that the ALS procedure 
tends to overfit. 

For each of the 100 simulated data sets an alternative Bayesian estimate of © was also obtained, 
in which the elements u^ J of the U-matrices were assumed to be a priori independent normal(0, 100) 
random variables. This non-hierarchical approach fixes the amount of regularization, and does not 
recognize patterns in © that could be represented by correlations among the latent factors. Not 
surprisingly, estimates obtained from this approach generally had higher MSEs than the estimates 
based on the hierarchical model (in 99% of the cases using the posterior mean estimates, and 92% 
of the cases using rank-4 point estimates). 

4.2 Misspecified rank 

A more realistic data analysis situation is one in which the true rank of © is not known. In 
this subsection we investigate the MSEs of ©ls ©hb for estimating the rank-4 arrays generated 
as described above, but when the assumed rank is R £ {1,...,8}. Using the same simulation 
and estimation procedures as described in the previous subsection, a was obtained for each of 
the 100 simulated ©-arrays and for each combination of the two estimation methods and ranks 
R € {1,...,8}. For each of these 100 x 2 x 8 estimates, a relative MSE ||0 - ©||Vl|Y - ©|p 
and RSS ||Y — 0|p/||Y|p was computed as before. The first of these measures the fidelity of the 
estimate to the true underlying parameter, and the second to the the data. 

Summaries of the results are plotted in the four panels of Figure [2j For example, each boxplot 
in the top row of plots summarizes the 100 RSS values of the ALS estimates assuming a given 
rank. As expected, as the rank increases the percentage of the variation in Y explained by the 
ALS estimate goes up and the RSS goes down. However, the first plot of the bottom row shows 
that increasing the rank of the ALS estimate beyond 3 generally increases the MSE. In contrast, 
the MSE of the hierarchical estimate ©hb generally achieves a minimum at the actual rank of 4, 
and increases relatively slowly as the assumed rank is increased beyond 4. This suggests that the 
hierarchical Bayes approach is more robust to overfitting than the least squares method. Since the 
"true" rank of © is generally not known, it may be desirable to fit a model with a moderately 
large rank in the hopes of capturing as much of © as possible. The above results suggest that 
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Figure 2: RSSs and MSEs under different presumed ranks and estimation methods. 

a hierarchical Bayes estimate may be preferable in such situations, as it provides a more stable 
estimate of across different choices of the presumed rank. 

4.3 Rank selection 

We now consider the possibility of estimating the rank R from the observed data array Y. One pop- 



ular model selection procedure is to minimize the Bayesian information criterion, or BIG Schwarz 



1978 



The BIG for a given model and data set y is —2lnp{y\9) +plnn, where 9 is the parameter 
estimate, p is the dimension of 9 and n is the sample size. In practice, the BIG can be computed 
for a range of different models, and the one giving the smallest BIG is selected. This procedure 
favors models that fit well (in terms of likelihood) but penalizes model complexity. 

As pointed out by Pauler 1998|, for hierarchical models the number of parameters can be 



ambiguous. As a remedy, Spiegelhalter et al. 2002 proposed the deviance information criterion 



or DIG which can be computed from output of a Markov chain. The DIG is given hy D+p , where 
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1 


2 


3 


4 


5 


6 


7 


8 


2 


0.10 


0.74 


0.07 


0.05 


0.02 


0.01 


0.01 


0.00 


4 


0.08 


0.15 


0.27 


0.28 


0.06 


0.07 


0.04 


0.05 


6 


0.07 


0.18 


0.19 


0.17 


0.10 


0.08 


0.09 


0.12 



Table 1: Rank selection using DIG: R is the rank under which 100 simulated data sets were 
generated. Entries in the table give the percentage of simulations for which R, the DlC-optimal 
rank, took on the values 1 through 8. 

D is the average value of —2lnp{y\6) across iterations of the Markov chain, and p is the "effective 
number of parameters", given hj p = D + 2lnp{y\0), where 9 is an estimate of 9. For our model 
the parameters are and o"^, and we take our estimates to be the posterior mean of and the 
mean residual error under the posterior mean, respectively. 

For each of the 100 simulated data sets described above we computed the DIG for each value of 
i? G {1, . . . , 8}, and took our "estimate" R of R to be the rank for which the DIG was minimized. 



As shown in the second row of Table 4.3, the true rank of 4 was the most frequently selected value 
of R, followed closely by 3. The fact that R = 3 was selected 27 times is somewhat ameliorated 
by the fact that in 15 of these instances the "best" rank in terms of MSE turned out to be 3 (11 
cases) or 2 (4 cases). 

To further evaluate the BIG procedure, we also reran the entire simulation study when the true 
rank was R = 2 and when it was R = 6. For the case of i? = 2, the DIG selected i? = 2 in 74% 
of the cases, indicating that in this situation the true rank can be identified with a high degree of 
accuracy. Rank selection with DIG was more problematic when the true rank was 6. As we would 
hope, the distribution of ranks selected here is somewhat shifted to the right from the distribution 
of selected ranks when i? = 4, but as indicated in the table, the true rank of 6 can not be identified 
accurately with DIG. However, the DIG is not as bad in terms of obtaining the rank that gives the 
best approximation to the true in terms of MSE. For example, 75% of the 71 simulated data 
sets for which R was less than 6 also attained their minimum MSE at an i?-value less than 6. In 
particular, the seven data sets for which R = 1 also attained their minimum MSE with a rank 1 
model. 
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5 Example: Multiway means for cross-classified data 

Large scale surveys collect data on a variety of numerical and categorical variables. Numerical 
data are often summarized by computing sample averages for combinations of a set of categorical 
variables. For example, letting y be a p-dimensional vector of numerical variables and x a K- 
dimensional vector of categorical variables, interest may lie in the population average of y for a 
given value of x, which is denoted as fi^ £ W. However, if the number of categorical variables 
or their number of levels is large compared to the sample size, then we may lack sufficient data 
to provide stable estimates for each fx^ separately. For example, the 2008 General Social Survey 
includes data on the following six variables: 

• t/i (words): number of correct answers out of 10 on a vocabulary test; 

• 2/2 (tv): hours of television watched in a typical day; 

• xi (deg) highest degree obtained: none, high school. Bachelor's, graduate; 

• X2 (age): 18-34, 35-47, 48-60, 61 and older; 

• X3 (sex): male or female; 

• X4 (child) number of children: 0, 1, 2, 3 or more. 

Complete data for these variables are available for 1116 survey participants. However, there are 
4x4x2x4 = 128 levels of x. More than half of these cells have 5 or fewer observations in them, 
and about 75% have less than 12 observations. As such, an estimator of /i^ that uses only data 
from group x, that is {y^ : Xj = x}, will be subject to a large sampling variance. 

5.1 A multilinear model for group means 

Statistical remedies to this problem typically allow the estimate of n^ to depend on data from groups 
other than that corresponding to x. One such approach is to parameterize the set of multivariate 
means {/x^ : x G X} by a smaller number of parameters. Another approach is via a hierarchical 
model that allows for the shrinkage of set of parameters towards a common group center. Here we 
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consider the following model which has both of these features: 



{y. 


Xi 


= x} 


{7x 


: X 


<^X] 



iid 



iid 



multivariate normal(/ix) S) 
multivariate normal(0, fi). 



(5) 
(6) 
(7) 



Equation [5] indicates that the data within a cell are modeled as multivariate normal, with cell- 
specific means and a common covariance matrix. Equations [6] and [7] express each /x^ as equal 
to a "systematic" component (3^ plus patternless noise 7^. The collection {/3x : x G X} can be 
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Figure 3: Marginal distributions of vocabulary score and television hours watched for different 
levels of degree, age, sex and number of children. 

represented as an m-i x • • • x mx x p array B, where mk is the number of levels of categorical 
variable x^. These values are not separately estimable from the noise 7^ unless we assume B lies 
in a restricted subset of the set of arrays of this size, such as the set of rank-i? arrays. In this 
setting, where one of the modes of the array represents variables and each other mode represents 
the different levels of a single categorical variable, it is useful to express the array decomposition 



17 



as follows: 

B = (U(^) , . . . , U(^) , V) , or equivalently 
/3, = V(u(\)o...ouW). 

The equations above describe a hierarchical model in which the heterogeneity among {fx^ : x G X} 
is centered around a low-dimensional array B = {(3^ : x G X}. Such a model is similar to 



representing an interaction term in an AN OVA with a reduced rank matrix Tukey, 1949, Boik 



1986 , 1989 . However, the hierarchical approach used here allows for consistent estimation of each 
/^x) but shrinks towards the lower-dimensional representation B when data are limited. 

Estimation for this model can proceed as described in Section 4 with a few modifications. As 
before, a Gibbs sampler can be used to approximate the posterior distribution of the unknown 
parameters. Using a conjugate inverse- Wishart prior distribution for S and the other prior distri- 
butions as in Section 4, one iteration of the Markov chain is as follows: 

1. sample 5] ~ p(5]|{yj : i = 1, . . . , n}, {/x^ : x E X}), an inverse- Wishart distribution; 

2. sample /x^ ~ Pit^xliYi ■ ^i — ^Ij/^xi ^)^ ^ multivariate normal distribution for each x £ X; 

3. sample ft ~ p(ft\{^^, (3^ : x G Af}, V), an inverse- Wishart distribution; 

4. iteratively sample {U^ ' ,k = 1, . . . , K} as in Section 3; 

5. sample V ~ p(V|U, {/x^ : x G X}, ft), a matrix normal distribution. 

Derivations of the full conditional distributions are straightforward and are available from the author 
and in the computer code available at the author's website. Provided here are a few comments 
that describe some of the calculations: Let the model for the p x R matrix V be such that the R 
columns are i.i.d. multivariate normal with a zero mean vector and covariance equal to ft. Doing 
so links the scale of the factor effects for fj.^ to the scale of the across-group differences "f^. Writing 
/i^ = n^^^Vx and V = ft-^/'^V, we have 

/ix = V(u«o...ouW)+7, , with 
{Tx} ~ multivariate normal(0, 1). 

From this, we see that sampling from the full conditional distribution of U*^ V ■ • > U' ' can be done 
just as in Section 3.2, with o"^ replaced by 1 and the observed array data replaced by the values 
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of the array defined by {p.^ : x G X}. Similarly, the full conditional of V is the matrix normal 
distribution from Section 3.1, again with a"^ replaced by 1 and {^-^ : x G X} taking the place 
of the observed array data. A value of V can be generated from its full conditional distribution 
by sampling V from this matrix normal distribution and then setting V = f2 ' V. Finally, note 
that the inverse- Wishart full conditional distribution of for ft depends on V. If we have ft ~ 
inverse- Wishart(riQ ,r]o) then the full conditional distribution of CI is inverse-Wishart(ri^ ,r/i) 
where r]i = 7]o + R + nf=i mj, and fJi = fio + V^V + J2^{fi^ - /3x)(/^x - /^x)^- 
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Figure 4: Posterior estimates of factor scores, along with the amount of shrinkage as a function of 
cell-specific sample size. 

5.2 Posterior analysis of GSS data 

We now discuss posterior inference for the GSS data based on the above model and estimation 
scheme. The numerical variables yi (words) and 2/2 (tv) were first centered and scaled to have zero 
mean and unit variance. Prior distributions for the covariance matrices SI and ft were taken to 
be independent inverse- Wishart distributions with p + 1 = 3 degrees of freedom each and centered 
around the sample covariance (correlation) matrix of {yj,i,yj,2j^ = 1, • • • ,n}. Doing so gives these 
prior distributions an empirical basis while still keeping them relatively weak. Such priors are 
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similar to the "unit information" prior distributions described in Kass and Wasserman 1995 . A 
rank-2 model for the array of means was used so that the estimated factor effects could represented 
with a simple two-dimensional plot. 

The algorithm described above was used to construct a Markov chain consisting of 22,000 itera- 
tions, the first 2,000 of which were discarded to allow for convergence to the stationary distribution. 
Parameter values were saved every 10th iteration, leaving 2,000 saved values for Monte Carlo ap- 
proximation. Mixing of the Markov chain was examined by inspecting the sequences of saved values 
of S, ft and the average value of {f3yr} across levels of x. The effective sample sizes for these pa- 
rameters were all over 1,000. Some summary descriptions of the resulting posterior estimates are 
shown in Figure |4J The first panel plots point estimates of the latent factors V and U' ^, . . . , U' ' . 
These were obtained as follows: A posterior mean array B was obtained from the 2,000 saved 
values of B from the Markov Chain. This array is not quite a rank-2 array, as rank is not gen- 
erally preserved under array addition. An alternating least-squares algorithm was performed on 
B to obtain a rank-2 point estimate B and a multiplicative decomposition in terms of matrices 
V, U^^\ . . . ,U^^\ The difference between B and B was small, with ||B - B||Vl|B|p = 0.00011. 



These point estimates of the latent factors are shown in the first panel in Figure |4j For example, the 
matrix U represents the multiplicative effects of deg, and consists of a two-dimensional vector 
for each level of this variable. These vectors are plotted in the figure with "deg.l" representing no 
degree, "deg. 2" a high school degree, and so on. Similarly, the matrix V has a two-dimensional 
vector for each of the two numerical variables. To interpret the figure, note that the estimated 
mean for either numeric variable in any cell can be obtained by coordinate-wise multiplication and 
then addition of the latent factor vectors. For example, the proximity of the "words" vector to the 
"deg. 3" and "deg. 4" vectors indicates that these two groups have higher mean vocabulary scores 
than the other two degree categories. Similarly, the close proximity of the "child. 1", "child. 2" and 
"child. 3" vectors indicates lack of heterogeneity in the means for three of these four categories 
across levels of the other x- variables. Finally, note that some care should go into interpreting the 
figure, as the array B = (U^ -',... ,U*^ ; V) is invariant to certain transformations of the factors. 
For example, multiplying either the first or second column of each of an even number of factor 
matrices by -1 does not change the value of B. 

The second plot in Figure H highlights how the estimated cell means {fi^} differ from the 
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empirical cell means {y^} as a function of sample size. This plot indicates what we would expect 
from a hierarchical model: The difference between estimated cell mean and empirical cell mean 
decreases with increasing sample size. A cell with a large sample size will have Yx ~ Ax ) whereas a 
cell with a small sample size will have an estimated mean fi^ shrunk towards the reduced-rank value 
(3-^. Note that without the multiplicative effects in Equation m of the hierarchical model, the cell 
means would all be shrunk towards a common vector, regardless of the value of x. In contrast, the 
hierarchical multiplicative effects model allows cell-specific shrinkage, as estimated by the reduced 
rank array B. 

An alternative approach to the analysis of these data might involve MANOVA or a hierarchical 
model similar to the one above but in which P^ is parameterized in terms of additive effects, so 

(1) (K) (k) 

that /3x = Ux/ -|- • • • -|- Uxi^ with each ui^. G M^. Such additive models have representations 
as multilinear models, although of course they are restricted to be additive. For comparison, an 
additive MANOVA model was fit and the average value of (Yx^/^x)^ ^^^ computed, measuring the 
lack-of-fit of the additive model. This value was about the same as the corresponding value for the 
multilinear model for the tvhours variable, but 15% larger for the words variable. This indicates 
that some patterns among the cell means for words cannot be represented with an additive model. 
In general, we may expect that some aspects of the heterogeneity among the n^s will not be 
additive. In such situations, it may be preferable to use a multiplicative model whose complexity 
can be controlled with the choice of the rank R rather than to have to consider the inclusion and 
estimation of a variety of higher-order interaction terms. 

6 Example: Analysis of longitudinal conflict data 

The theory of the Kantian peace holds that militarized interstate disputes are less likely to occur 
between democratic countries. Ward et al. [2007 evaluate this theory using international coopera- 



tion and conflict data from the cold war period. The data include records of militarized conflict and 
cooperation every five years from 1950 to 1985, along with economic and political characteristics of 



the countries. In this section we analyze a subset of the data from Ward et al. 2007 . These data 
include cooperation, conflict and gross domestic product data (gdp) for each of m = 66 countries 
every fifth year, t £ {1950, 1955, . . . , 1980, 1985}. Additionally, each country in each of these years 
has a polity score, measuring the level of openness in government. A positive polity score is given 
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to democratic states, while a negative score is given to authoritarian states. 

The cooperation and conflict data form a three-way array with two modes representing country 
pairs and one mode representing time. In this section we will fit an ordered probit model of 
cooperation and conflict data as a function of gdp and polity. Specifically, for each unordered pair 
{i,j} of countries and each time t, our data are as follows: 

yi,j,t £ {—5, —4, . . . , +1, +2}, indicating the level of military cooperation (positive) or conflict 
(negative) between countries i and j in year t; 

Xi,j,t,i = loggdpj + loggdpj, the sum of the log gdps of the two countries; 

Xi,j,t,2 = (loggdpj) X (loggdpj), the product of the log gdps; 

Xi,j,t,2, = polity^ X polity^-, where polity^ G {-1,0,+!}; 

Xi,j,tA = (polity^ > 0) X (polity^ > 0). 

The sample space for yij^t is ordered but the scale is not meaningful: The difference between y = 
and y = 1 is not comparable to the difference between y = —5 and y = —4. For this reason we use 
the following ordered probit model to relate yij^t to Xij-^^: 

yij^t = max{k : Zi^j^t > Ck,k £ {-5,-4,. .. ,+l,+2}}. 

In this model the parameters to estimate include the regression coefficients (3 and the cutoffs 
c = (c_4, . . . , c+2), with c_5 = —00. The usual probit regression model would assume the 7i,j,t's are 
independent standard normal variables (standard, as the scale of these error terms is not separately 



identifiable from (3 and c). However, results of Ward et al. 2007 suggest that the residuals from 



regression models of international relations data are generally not patternless. For example, we 
might expect 7j,i,i, . . . ,7i,66,t to exhibit statistical correlation, as these residuals are all associated 
with country i. More subtle might be higher order patterns common in relational data: If i and j 
have a positive relationship and j and k have a positive relationship, then a positive relationship 
between i and k is more likely. 



Hoff 2008 describes how two-way factor models can be used to represent patterns in ordinal 



matrix-valued relational and social network data. Here we extend this idea, using a three-way 
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factor model to represent the longitudinal relational patterns represented by the array T = {jij^t}- 
Specifically, the following factor model is proposed: 

li,j,t = (ui,Uj,Vi) + ejj,i , with 
{(^i,j,t = (^j,i,t} ~ normal(0, 1). 

The Uj's are vectors representing heterogeneity among the countries and the vj's represent het- 
erogeneity over time. This is a modification of the usual three-way PARAFAC representation 
to accommodate the fact that the data are symmetric {yij,t = yj,i,t)- This model has a simple 
interpretation: Letting Tt = {jij^t '■ ihj) £ {Ij • • ■ j"*-}^}) we have 

Tt = UAiU^ + Et , where U = (ui, . . . , u^)'^ and At = diag(vt). 

This symmetric version of the PARAFAC model is analogous to a type of eigenvalue decomposition 
of the collection of square matrices {Figso, . . . ,ri985} in which the eigenvectors are held constant 
across matrices, but the eigenvalues are allowed to vary. The resulting matrices UAtU are then 
each symmetric and of rank R. Heterogeneity across countries is determined by the rows of U, and 
heterogeneity across time is determined by the Aj's. 

The unobserved quantities in this model include the latent variable array Z as well as the pa- 
rameters U, V and (3. Using the same hierarchical prior distributions for U and V described in 
Section 3.2 and a diffuse multivariate normal(0, 100 x I) prior distribution for /3, we can implement 
a Gibbs sampler to approximate the joint posterior distribution j5(Z,U, V,/3|Y,X). All full con- 
ditionals are standard, and are available from the supplementary material at the author's website. 
Using a rank-2 model, the Gibbs sampler was run for 505,000 iterations, dropping the first 5,000 
to allow for burn-in and then saving the parameter values every 10th iteration. Convergence of 
the Markov chain was monitored via the sampled values of (3. The effective sample sizes for the 
four regression coefficients based on the 50,000 saved scans were 12,548, 16,622, 1,386 and 8,878 
respectively. 

The plots in Figure[5]show the marginal posterior distributions of the four regression coefficients, 
along with 95% highest posterior density confidence intervals. The results indicate a negative 
association between gdp and the latent variable z, refiecting the fact that a majority of the confiicts 
over the cold war period involved economically large countries. The plots in the second row indicate 
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Figure 5: Posterior densities for the elements of f3. Gray lines are 95% confidence intervals. 



that Zij tends to be larger if both i and j have polity scores of the same sign, but that there is not 
strong evidence for a further increase if the polities of i and j are both positive. 

Figure |6] displays a summary of the posterior distribution of U and V. This summary was 
obtained as follows: First, a Monte Carlo approximation of the posterior mean of the three- 
way array = (U, U, V) was obtained using the values generated from the Markov chain. The 
alternating least-squares algorithm was then applied to to obtain values U and V. The columns 
of U were normalized to be unit vectors, and the columns of V were then rescaled accordingly. The 
columns of the latent factor matrices were then permuted so that the magnitude of the columns 
of V were in decreasing order. The resulting values are plotted in Figure [6J The large square plot 
shows the estimates of the two-dimensional latent factor vectors {uj} for each country, with a larger 
font used for those countries with larger vectors. The second column gives the values of vt^k, sorted 
chronologically. Since all of these values are positive, two latent vectors {uj^,Uj2} being in similar 
directions indicates a tendency for countries ii and «2 to cooperate militarily, whereas vectors in 
opposite directions indicate a tendency for conflict. For example, the vectors corresponding to USA 
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Figure 6: Posterior estimates of the country- and time-specific factors. 

and South Korea (ROK) are similar to each other and in the opposite direction of China (CHN) 
and North Korea (PRK). The heterogeneity of the vt's over time allows for different patterns of 
conflict across the years. For example, cooperation and conflict in 1980 and 1985 are described 
primarily by the first dimension of the factors (ui), whereas events in 1955 and 1975 primary by 
the second (1^2)- 

7 Discussion 

This article has presented a hierarchical version of a reduced-rank multilinear model for array data 
and a Bayesian method for parameter estimation. Unlike least-squares estimation, a Bayesian 
approach allows for regularized estimates of the potentially large number of parameters in a mul- 
tilinear model. Unlike a non-hierarchical Bayesian approach, the hierarchical approach provides 
a data-driven method of regularization, and a more flexible representation of the patterns in the 
data array. Additionally, in a simulation study the estimates provided by the hierarchical approach 
showed robustness to rank misspecification, as compared those obtained from a least-squares or 
non-hierarchical approach. 
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Another advantage of the Bayesian approach is that it aUows for the incorporation of multihnear 
structure into a broad class of statistical models. For example, a least-squares approach would be 
inappropriate for the ordinal cooperation and conflict data in Section 6, but Bayesian estimation for 
these data, using a probit model with multilinear effects, is relatively straightforward. As another 
example, the survey data presented in Section 5 was not in the form of an array, but the cell 
means corresponding to the 128 levels of the 4 categorical variables can be represented as such. A 
reduced-rank multilinear model provides a parsimonious representation of the cell means, but also 
is more flexible than a simple additive effects model. 

An important line of future research is the study of the theoretical properties of hierarchical 
Bayesian approaches to parameter estimation for multiway data arrays. For a matrix model in 
which Y = -|- E and E is a matrix of normally-distributed noise, Tsukuma [2008 2009 studies 



Bayesian and hierarchical Bayesian approaches to providing admissible and minimax estimates of 
0. One aspect of this work shows that under certain prior distributions on the singular vectors of 0, 
the Bayes estimates are equivariant and can be obtained by shrinking the singular values of Y. Such 
estimates are somewhat analogous to those presented in this article for multiway data, as shrinking 
the singular values of a matrix is similar to regularizing the variance of a set of multiplicative 
factors. The author is currently investigating the extent to which such similarities between the 
matrix and array models lead to similar theoretical properties of Bayesian estimates in the two 
cases. Additionally, hierarchical Bayesian procedures, like the one in this article, often produce 
estimates similar to those from empirical Bayes and James-Stein procedures, which have been shown 



to outperform the least-squares criterion in a variety of multivariate estimation problems [James 



and Stein, 1961, Efron and Morris 1973 . It seems likely that estimators from such shrinkage 
procedures will enjoy similar advantages over least squares estimation as the hierarchical model 
presented in this article. 

A popular alternative approach to shrinkage estimation for high-dimensional models is based 



on Li penalization Tibshirani, 1996 , in which an estimate is obtained by minimizing the residual 
sum of squares plus an Li penalty on the parameter values. In the context of estimating a three- 
way array 0, this could mean obtaining the value that minimizes ||Y — 0|p + A^j , j^ \di,j,k\- 
However, the multiplicative parameterization of O-ij^k = X]r=i ~ 'Ui^rVj^rWk^r rnakes this opti- 
mization problem difficult. Alternatively, minimization of IIY — 01 P + V- ^ u ^XrWirVjrWkA or 
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||Y — 0|p + Ai Ylir l^^.'^'l + -^2 Xli r l^j.^l + '^3 Sfe r l^fc,r| would be feasible via modifications to the 
ALS procedure. The latter criterion provides estimators that are equivalent to posterior modes un- 
der double exponential prior distributions, and is similar to a criterion used for matrix estimation 
by Witten et al. [2009 , as an alternative to least-squares estimation via the SVD. However, unlike 



a hierarchical modeling approach, such Li-penalized estimators always shrink towards zero, and 
do not take advantage of the potential variances and correlations in Y (such as those described by 
Equation HI) that could improve estimation of 0. 

Replication code and data for the numerical results in this paper are available at the author's 



website: http: //www. stat .Washington. edu/~hoff 
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